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We investigate effects of ordering in blocked matrix-matrix multiplication. We find that sub- 
matrices do not have to be stored contiguously in memory to achieve near optimal performance. 
Instead it is the choice of execution order of the submatrix multiplications that leads to a speedup 
of up to four times for small block sizes. This is in contrast to results for single matrix elements 
showing that contiguous memory allocation quickly becomes irrelevant as the blocksize increases. 



I. INTRODUCTION 

In current state-of-the-art algorithms for large scale 
electronic structure calculations probably the most im- 
portant operation is sparse matrix-matrix multiplica- 
tion. To name but a few important applications, sparse 
matrix-matrix multiplication is a computational ker- 
nel of: (1) density matrix purification 1-8 , (2) density 
matrix minimization 8-14 (3) density matrix perturba- 
tion theory 15-17 , (4) computation of interior eigenpairs 
of potential matrices 18-20 , and (5) time-dependent re- 
sponse calculations 21-24 . Since matrices occurring in 
electronic structure calculations often have a natural 
blocked structure arising from local atom centered ba- 
sis functions, performance can be dramatically improved 
by using a blocked data structure 25-28 . In addition, the 
sparse blocked matrix-matrix multiply is important in 
many other fields as for example the evaluation of ma- 
trix functions including the matrix exponential, the ma- 
trix inverse 29 , inverse factorizations 30-33 , and multigrid 
methods 34 where local blocking may occur. 

CPUs and memory of modern computer architectures 
work at different speeds. Main memory operates at lower 
speeds to reduce the price and power consumption of 
computers. Direct access to main memory causes the 
CPU to stall for several, sometimes hundreds of cycles. 
To achieve decent computer performance cache memory 
that store frequently accessed data was introduced into 
modern CPU designs. Cache memory works at speeds 
comparable with the CPU but typically has only a size 
of about 0.1% to 1% of computer main memory. The 
cache stores data in chunks, so-called cache lines, which 
are usually on the order of tens of bytes long. A memory 
manager controls storing and evicting data from these 
cache lines using heuristics (a typical algorithm called 
Least Recently Used (LRU), evicts the least recently 
used cache line when trying to store a new one) and 
it prefetches subsequent lines when sequential memory 
access is detected or explicit machine language instruc- 
tions are given to the CPU. When implementing a data 
intensive algorithm, both of these cache features - LRU 
eviction and hardware prefetching - can be used to opti- 
mize performance. 

Previous research into ordering effects for the matrix- 
matrix multiply has been focused on either the dense 



or the very sparse case. In the dense case, block re- 
cursive algorithms have drawn much attention 35-38 . In 
the case of sparse matrices, Toledo 39 found better than 
2x speedups with the use of locality enhancing orderings 
based on space filling curves. Here, we are interested 
in the intermediate case of matrix-matrix multiplication 
involving sparse matrices with local blocking, as occurs 
with local basis functions, finite clement methods 40 or 
reordering schemes 41-43 . 

The performance of blocked matrix-matrix multiplica- 
tion depends on (1) the performance of block operations 
and (2) how blocks are stored and accessed. Block oper- 
ations can be delegated to some standard linear algebra 
library optimized for the particular platform 44-48 . Here, 
we focus on (2), exploring the effects of orderings for 
small blocks, consistent with our interest in the interme- 
diate case of locally blocked sparse matrices. 

This article is organized as follows: In section II we 
describe in more detail the locality issues we are address- 
ing. In section III we discuss our results and finally, in 
section V we conclude. 



II. DATA LOCALITY 

Data locality is known to be important for achiev- 
ing good performance of matrix multiplication on mod- 
ern computer architectures 39 . In addition, Translation 
Lookaside Buffer (TLB) misses can significantly impact 
performance 45 . In this study we will focus on the locality 
problem. 

A. Effects of ordering on performance 

We divide an N x N matrix into submatrices of size 
b x b, where b is the blocksize. We choose b so that the 
resulting blocked matrix will consist of n x n submatrices. 
We assume for simplicity that n b = N . The matrix 
product can be written as a combination of submatrix 
products, 



C, ^A ik B kj {i,je[l..n}}- (1) 

fc=i 



block size 



perfect locality 
no locality 



FIG. 2: Left panel: The Peano curve ordering of the elements 
of a 3 x 3 matrix block. Right panel: Recursive construction 
of Peano curve ordering. 



FIG. 1: A qualitative illustration of the performance of a 
blocked matrix multiplication with perfect locality and no lo- 
cality. 

The n 3 products may be evaluated in any order and we 
are free to choose the precise ordering of execution and 
the allocation of the matrix blocks. Algorithm 1 illus- 
trates this point. 



Algorithm 1 General matrix multiplication algorithm 

Allocate n 2 submatrix blocks for A, B, and C . 

for all (i,j,k) in blocked matrix product of eq. (1) do 

Multiply blocks Ait and Bkj- 

Add result to block dj. 
end for 



Considering that we use a highly optimized matrix 
multiplication function on the block level, what perfor- 
mance gain, if any, can we hope to achieve by arrang- 
ing the blocks in a particular order? Clearly, we can 
construct the two limiting cases easily, "perfect locality" 
and perfect non-locality or "no locality" . We construct 
two tests: In the first test we multiply the same 2 blocks 
(e.g. An and Bn to get Cn) n 3 times; in this way 
perfect locality is obtained since the computer's mem- 
ory manager can keep the three submatrices in cache 
throughout the whole operation. In the second test we 
randomize the multiplication and the allocation order of 
the blocks. This will break most of the locality since 
memory prefetches are only possible in the rare event 
that two blocks are close together in memory and the 
time between two multiplication steps in which a partic- 
ular block is reused is very long and will almost certainly 
lead to a cache miss. If ordering effects are significant 
for performance, we expect to find results qualitatively 
similar to those shown in Fig. 1. 

B. Space filling curves for optimized locality 

It is well known that ordering matrix elements in mem- 
ory along locality preserving space filling curves improves 
the performance of matrix operations due to memory 



subsystem design issues 9 . Bader and Zcngcr 36 applied 
this idea to dense matrix-matrix multiplication. They 
devised a block recursive scheme which allocates the ma- 
trix elements along a Peano curve 50 and reorders the 
multiplications of matrix elements to optimize locality. 
They point out that such a scheme is cache oblivious 
and platform independent. Compared to the standard li- 
brary MKL 47 , Bader and Zcnger find that the reordered 
matrix-matrix multiplication yields competitive perfor- 
mance when processor specific optimization techniques, 
e.g. Intel's Streaming Single Instruction, Multiple Data 
Extensions (SSE), are turned off. In the following we 
will apply the idea of improving locality by ordering to 
blocked matrices with blocks larger than single matrix 
elements. 

Figure 2 illustrates the recursive construction of the 
Peano curve ordering of the matrix elements. This de- 
fines the matrix element index ordering and also the or- 
dering of the matrix elements in memory. The Peano 
ordering does not uniquely define a multiplication order 
and we chose the multiplication order that maximizes lo- 
cality according to Bader and Zenger 35 ' 36 . 



C. Temporal vs. spatial locality 

Locality can be divided into two types: spatial and 
temporal locality. Spatial locality is the kind of locality 
one achieves by allocating data contiguously in memory. 
Such locality takes advantage of the hardware prefetch - 
after operations on a block are finished, the next one can 
be found ready in the cache. Temporal locality on the 
other hand means that data needed is already present in 
cache because it was used in a previous computational 
step and does not have to be loaded from memory. The 
impact of this locality is related to the cache line manage- 
ment algorithm (e.g. LRU). From a programmer's point 
of view, spatial locality concerns may influence the choice 
of the submatrix allocation method, whereas temporal 
locality concerns may influence the choice of execution 
order of the matrix multiplication. 

It is reasonable to assume that by ordering the blocked 
matrix product along a Peano curve, we optimize both 
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FIG. 3: Opteron 248: Comparison of the blocked matrix mul- 
tiplication performance for different ordering. The dense case 
is shown as reference and represents the performance of a 
dense matrix multiplication. 



FIG. 4: Xeon Woodcrest: Comparison of the performance of 
Peano curve ordering and Peano curve multiplication ordering 
with no spatial locality. 



spatial and temporal locality, just as is the case for sin- 
gle matrix elements 35,36 . In the following we refer to the 
case in which both spatial and temporal locality arc op- 
timized along a Peano curve as "temporal and spatial 
locality" . We want to separate the two types of locality 
and understand how each of them affects performance. 

We can destroy spatial locality by avoiding any kind of 
contiguous memory allocation during the matrix block al- 
location. Elements within the matrix blocks arc of course 
still allocated in contiguous memory since we want to 
be able to multiply matrix blocks by calling a standard 
generalized matrix-matrix multiply (gemm). However, we 
randomize the allocating order of the submatrix blocks, 
which makes it unlikely that two consecutive blocks are 
close to each other in memory, and break in this way con- 
tiguous allocation on the inter-block level. Since alloca- 
tion order does not affect the multiplication order, and 
therefore temporal locality, we can measure the effect of 
temporal locality by itself and compare with our result 
for full Peano curve ordering. In the following we will 
refer to this non-contiguous case as "temporal locality" . 
We expect the performance to lie somewhere between the 
2 idealized curves indicated in Fig. 1. 



III. RESULTS 

In the previous section we discussed four different 
strategies of computing a blocked matrix-matrix prod- 
uct with different data locality features: "perfect local- 
ity" , "temporal and spatial locality" , "temporal local- 
ity" , and "no locality" . Here we present the performance 
for the blocked matrix-matrix multiplication with these 
locality features for block sizes ranging from 3 to 200, 
on two different computer architectures. All calculation 
were performed on a single CPU. 

The results shown in Fig. 3 were obtained on an AMD 
Opteron 248 system clocked at 2.2 GHz with 8 GiB 51 
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FIG. 5: Opteron 248: Speedup of blocked matrix multipli- 
cation compared to the case of no locality. Speedup is given 
in percentage and a speedup of 100% means a doubling of 
performance. 



200 



150 



100 



50 



temporal locality (Peano) 
temporal and spatial locality (Peano) 
,l perfect locality 




10 



20 30 40 
block size 



50 



60 



FIG. 6: Xeon Woodcrest: Speedup of blocked matrix multi- 
plication. 
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of main memory, using the GotoBLAS 45 library. The 
results shown in Fig. 4 were obtained on an Intel Xcon 
Woodcrest 5150 system clocked at 2.66 GHz with 8 GiB 
of main memory, using GotoBLAS 45 . 

We generate 2 random square matrices of size N x N, 
where 2000 < N < 6000. These matrices are blocked 
into submatrices of size 6x6, the block size. The type of 
test determines the allocation method of the submatrix 
blocks and the execution order of the multiplication. We 
time the multiplication with the CPU time as reported 
by the operating system. The performance is defined as 
P = (2A 3 + 4A 2 ) /t, where t is the CPU time measured, 
taking into account one multiplication and one addition 
per term in eq. (1), one memory read operation per ma- 
trix element, and one write operation per matrix clement 
of the C matrix. We repeat each test 30 times to average 
out any fluctuations in the timing. 

On both platforms, we find that ordering has a pro- 
found effect on performance. The difference between a 
blocked multiplication with perfect locality and without 
locality is significant (see section IIA for a definition of 
the terms "perfect locality" and "no locality"). This 
point is emphasized in Figs. 5 and 6, which show the 
relative speedup achieved over the case of "no locality" . 
The performance of the dense matrix test is independent 
of the matrix size for TV > 2000. As a reference we also 
show in Figs. 3 and 4 the performance of the gemm() li- 
brary call in the large matrix limit. The "dense" result 
has to be seen as an upper limit of what can be achieved 
for large matrices. Due to the fact that libraries such as 
GotoBLAS tune their performance with regards to large 
matrices, we should expect a drop in performance for 
small matrices. This is the reason why our results indi- 
cate that a blocked approach becomes less efficient as the 
block size decreases. 

We clearly find that full Peano curve ordering achieves 
a performance which is close to perfect locality. This 
agrees with earlier findings by Bader and Zenger 36 . In 
the case of temporal locality, we find that the matrix 
multiplication achieves near perfect locality performance 
as well ("temporal locality" refers to the case where we 
separately allocate submatrix blocks so that they are not 
contiguous in memory, see section II C). The relative 
speedup achieved by temporal ordering is almost iden- 
tical to the speedups achieved by Peano curve ordering 
or the case of perfect locality. This is in contrast to the 
results for single matrix elements. This shows that spa- 
tial locality becomes quickly irrelevant as the blocksizc 
increases. We believe that this result has not been ap- 
preciated until now. 



IV. DISCUSSION 

A closer look at how modern computer memory man- 
agers operate reveals that one might expect our find- 
ing that performance of matrix-matrix multiplication is 
controlled by temporal locality, and not spatial locality. 



As pointed out in the introduction, the cache is orga- 
nized in cache lines of fairly limited size. In the case of 
the Opteron 248, a cache line contains 64 bytes. The 
Opteron's memory manager will prefetch cache line n + 3 
when it notices access to cache line n, followed by access 
to cache line n+ 1, which is true also for the case of a de- 
scending access pattern. The total size of a typical cache 
is much larger than the size of a single cache line and is 
typically on the order of 0.1% to about 1% of main mem- 
ory. In the case of the Opteron 248, the size of cache is 1 
MiB. The size of cache of the Xeon Woodcrest architec- 
ture on the other hand is 4 MiB, which is, at least partly, 
responsible for the delayed rise of the multiplication per- 
formance when compared to the Opteron. To remind the 
reader, we refer to spatial locality as having to do with 
prefetching and temporal locality as referring to data be- 
ing reused and therefore already present in cache. We 
conclude that cache line size and memory prefetch are 
the relevant features regarding spatial locality and total 
cache size is what matters for temporal locality. Given 
that a cache line is very small we do not expect hardware 
prefetch effects and therefore spatial locality to matter for 
the submatrix sizes we studied. Temporal locality, on the 
other hand, is important up to relatively large submatrix 
blocks given the different length scale of the total cache 
size. 



V. CONCLUSIONS 

We find that ordering effects for blocked matrix multi- 
plications is important for performance confirming earlier 
findings by other researchers 26 ' 35 ' 36 . The performance 
gain is due to increased locality. By breaking down the 
block locality into two types, spatial and temporal local- 
ity, we find that temporal locality gives near perfect lo- 
cality by itself and that spatial locality can be neglected. 
This has important implications for the implementation 
of any blocked matrix multiplication method. The pro- 
grammer should worry about the execution order of the 
multiplication, but can safely ignore any concerns re- 
garding contiguous memory allocation. This allows for 
greater flexibility for blocked matrix data structures. 
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